Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance
نویسندگان
چکیده
There are many emerging database applications that require accurate selectivity estimation of approximate string matching queries. Edit distance is one of the most commonly used string similarity measures. In this paper, we study the problem of estimating selectivity of string matching with low edit distance. Our framework is based on extending q-grams with wildcards. Based on the concepts of replacement semilattice, string hierarchy and a combinatorial analysis, we develop the formulas for selectivity estimation and provide the algorithm BasicEQ. We next develop the algorithm OptEQ by enhancing BasicEQ with two novel improvements. Finally we show a comprehensive set of experiments using three benchmarks comparing OptEQ with the stateof-the-art method SEPIA. Our experimental results show that OptEQ delivers more accurate selectivity estimations.
منابع مشابه
Approximate String Joins in a Database (Almost) for Free Erratum
In [GIJ01a, GIJ01b] we described how to use q-grams in an RDBMS to perform approximate string joins. We also showed how to implement the approximate join using plain SQL queries. Specifically, we described three filters, count filter, position filter, and length filter, which can be used to execute efficiently the approximate join. The intuition behind the count filter was that strings that are...
متن کاملUsing q-grams in a DBMS for Approximate String Processing
String data is ubiquitous, and its management has taken on particular importance in the past few years. Approximate queries are very important on string data. This is due, for example, to the prevalence of typographical errors in data, and multiple conventions for recording attributes such as name and address. Commercial databases do not support approximate string queries directly, and it is a ...
متن کاملNikolaus Augsten Approximate Matching of Hierarchical Data
The goal of this thesis is to design, develop, and evaluate new methods for the approximate matching of hierarchical data represented as labeled trees. In approximate matching scenarios two items should be matched if they are similar. Computing the similarity between labeled trees is hard as in addition to the data values also the structure must be considered. A well-known measure for comparing...
متن کاملComparison of s-gram Proximity Measures in Out-of-Vocabulary Word Translation
Classified s-grams have been successfully used in cross-language information retrieval (CLIR) as an approximate string matching technique for translating out-of-vocabulary (OOV) words. For example, s-grams have consistently outperformed other approximate string matching techniques, like edit distance or n-grams. The Jaccard coefficient has traditionally been used as an s-gram based string proxi...
متن کاملVGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams
Many applications need to solve the following problem of approximate string matching: from a collection of strings, how to find those similar to a given string, or the strings in another (possibly the same) collection of strings? Many algorithms are developed using fixed-length grams, which are substrings of a string used as signatures to identify similar strings. In this paper we develop a nov...
متن کامل